In [1]:
# Exploratory Data Analysis
# What is Exploratory Data Analysis (EDA)?
#  EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. It is an essential step in the data analysis process, allowing you to understand the data's structure, detect outliers, and identify patterns.
#  EDA is typically performed before any formal modeling or hypothesis testing.

# End goal of EDA:
#  - To gain insights into the data
#  - To prepare the data for further analysis or modeling

# Note: If a typical Machine Learning project has 100% of the time allocated to it, EDA takes up about 70% of that time.
# If the EDA phase is not taken seriously, the rest of the project will be flawed.
# Infact, EDA is the most important step in the data analysis process. You may have to again perform EDA after the modeling phase to understand the model's performance and behavior.
# Therefore, EDA is an iterative process that may require revisiting as new insights are gained or as the data evolves.

# Step-by-step EDA process in the industry:
# 1. Data Collection: Gather data from various sources, such as databases, APIs, or files.
# 2. Data Cleaning: Handle missing values, remove duplicates, and correct inconsistencies in the data.
# 3. Data Exploration: Use descriptive statistics and visualizations to understand the data's distribution, relationships, and patterns.
# 4. Feature Engineering: Create new features or modify existing ones to improve the data's predictive power.
# 5. Data Transformation: Normalize or scale the data, if necessary, to prepare it for modeling.
# 6. Data Visualization: Create visual representations of the data to communicate findings and insights effectively.
# 7. Documentation: Document the EDA process, findings, and any decisions made for future reference.
# Real-time example with respect to Amazon dataset - What happens at every step of EDA as per the above steps:
# 1. Data Collection: Download the Amazon dataset from a public repository or API.
# 2. Data Cleaning: Check for missing values in the dataset like product reviews, remove duplicates like multiple reviews for the same product, and correct inconsistencies like different formats for dates.
# 3. Data Exploration: Use descriptive statistics to summarize the dataset, such as the average rating of products, the distribution of review lengths, and the number of reviews per product.
# 4. Feature Engineering: Create new features like the sentiment score of reviews, the length of the review text, or the time since the last review.
# 5. Data Transformation: Normalize the sentiment scores or scale the review lengths to prepare them for modeling.
# 6. Data Visualization: Create visualizations like histograms of review lengths, bar charts of average ratings per product category, or scatter plots of sentiment scores against review lengths.
# 7. Documentation: Document the EDA process, findings, and any decisions made, such as which features to include in the modeling phase or any patterns observed in the data.
# EDA is a crucial step in the data analysis process, and it is essential to take it seriously to ensure the success of any data-driven project.
In [2]:
# Lets do the necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
In [3]:
# Our Problem Statement - Titanic Dataset EDA
# Load the dataset from a CSV file
# https://raw.githubusercontent.com/ingledarshan/upGrad_Darshan/refs/heads/main/titanic.csv
df = pd.read_csv('https://raw.githubusercontent.com/ingledarshan/upGrad_Darshan/refs/heads/main/titanic.csv')

image.png

In [4]:
# Lets check the first 5 rows of the dataset
print("First 5 rows of the dataset:")
df.head()
First 5 rows of the dataset:
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

image.png

In [5]:
df.head()
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

image.png

In [6]:
# Understanding the meaning of the columns in the dataset:
# PassengerId: Unique identifier for each passenger eg: 1, 2, 3, ...
# Survived: Survival status (0 = No, 1 = Yes)
# Pclass: Passenger class (1 = First class, 2 = Second class, 3 = Third class)
# Name: Name of the passenger eg: "Braund, Mr. Owen Harris",  etc.
# Sex: Gender of the passenger (male, female)
# Age: Age of the passenger in years (float)
# SibSp: Number of siblings and spouses aboard the Titanic (integer)
# Parch: Number of parents and children aboard the Titanic (integer)
# Ticket: Ticket number of the passenger (string)
# Fare: Fare paid by the passenger (float) in British pounds
# Cabin: Cabin number of the passenger (string, may contain NaN values)
# Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
In [7]:
# What is the objective of this EDA?
# The objective of this Exploratory Data Analysis (EDA) is to gain insights into the Titanic dataset, understand the factors that may have influenced passenger survival, and identify any patterns or trends in the data. This analysis will help in building predictive models and making data-driven decisions.
# The EDA will include:
# 1. Data Cleaning: Handling missing values, duplicates, and inconsistencies.
# 2. Data Exploration: Analyzing the distribution of key features, such as age, fare, and survival rates.
# 3. Feature Engineering: Creating new features or modifying existing ones to improve the dataset's predictive power.
# 4. Data Visualization: Creating visual representations of the data to communicate findings effectively.
# 5. Documentation: Documenting the EDA process, findings, and any decisions made for future reference.
In [8]:
df.head()
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [9]:
# Let formulate some hypotheses questions that we can answer using this dataset:
# 1. What is the overall survival rate of passengers on the Titanic?
# 2. How does survival rate vary by passenger class (Pclass)?
# 3. Is there a significant difference in survival rates between male and female passengers?
# 4. How does age affect the survival rate of passengers?
# 5. What is the distribution of fares paid by passengers, and how does it relate to survival?
# 6. Are there any patterns in the number of siblings/spouses (SibSp) and parents/children (Parch) aboard the Titanic?
# 7. How does the port of embarkation (Embarked) affect survival rates?
# 8. Are there any correlations between the features in the dataset, such as age, fare, and survival?
# 9. What is the distribution of cabin numbers, and how does it relate to survival?
# 10. Are there any notable trends or patterns in the names of passengers that could provide insights into their backgrounds or social status?

# Now, let's start with the EDA process by checking the shape of the dataset
print("Shape of the dataset:", df.shape)
Shape of the dataset: (891, 12)
In [10]:
# Insights from the shape of the dataset:
# The dataset contains 891 rows and 12 columns, indicating that there are 891 passengers with 12 features each.
# This means we have a good amount of data to analyze, which can help us draw meaningful conclusions about the passengers and their survival on the Titanic.
In [11]:
# Let's check the data types of each column to understand the structure of the dataset
print("Data types of each column:")
print(df.dtypes)
Data types of each column:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [12]:
# Insights from the data types:
# The dataset contains a mix of numerical and categorical features.
# Numerical features include: Age, Fare, SibSp, Parch
# Categorical features include: Survived, Pclass, Sex, Embarked, Cabin, Name
# Understanding the data types will help us determine the appropriate analysis techniques and visualizations to use.
In [13]:
# Lets check the information about the dataset
print("Information about the dataset:")
df.info()
Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [14]:
# Insights from the data types:
# The dataset contains a mix of data types:
# - Integer: PassengerId, Pclass, SibSp, Parch
# - Float: Age, Fare
# - Object: Survived, Sex, Embarked, Cabin, Name, Ticket
In [15]:
df.head()
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [16]:
# Lets look at the summary statistics of the dataset
print("Summary statistics of the dataset:")
df.describe()
Summary statistics of the dataset:
Out[16]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [17]:
# Lets check for missing values in the dataset
print("Missing values in each column:")
# df.isnull().sum().sort_values(ascending=False)
df.isna().sum().sort_values(ascending=False)
Missing values in each column:
Out[17]:
Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64
In [18]:
# Insights from the missing values:
# The dataset has missing values in the following columns:
# - Age: 177 missing values (19.9% of the total rows)
# - Cabin: 687 missing values (77.1% of the total rows)
# - Embarked: 2 missing values (0.2% of the total rows)
# Handling missing values is crucial for accurate analysis and modeling.
Function Purpose Equivalent? Preferred?
isnull() Detects NaN/missing ✅ Yes Use what you prefer
isna() Detects NaN/missing ✅ Yes Slightly more "Pythonic"

So go with whichever you find more readable or memorable — they are two names for the same function.

In [19]:
!pip install missingno -q
# missingno is a Python library for visualizing missing data
import missingno as msno
# Visualizing missing values using missingno
msno.matrix(df)
Out[19]:
<Axes: >
No description has been provided for this image
In [20]:
msno.bar(df)
Out[20]:
<Axes: >
No description has been provided for this image
In [21]:
!pip install sweetviz -q
# sweetviz is a Python library for visualizing and analyzing datasets
# It generates a report that provides insights into the dataset, including missing values, feature distributions, and correlations.
import sweetviz as sv
# Generating a report using sweetviz
report = sv.analyze(df)
# Displaying the report
report.show_html('titanic_preprofiling_report_sv.html')
# The report will be saved as an HTML file named 'titanic_report.html'
# The sweetviz report provides a comprehensive overview of the dataset, including:
# - Feature distributions
# - Missing values
# - Correlations between features
# - Sample data
# This report can help us understand the dataset better and identify any potential issues or patterns that need to be addressed during the EDA process.
                                             |          | [  0%]   00:00 -> (? left)
Report titanic_preprofiling_report_sv.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
In [22]:
# Pandas Profiling is another library that can be used to generate a report for the dataset
# https://github.com/ydataai/ydata-profiling/blob/develop/examples/bank_marketing_data/banking_data.py
!pip install ydata-profiling -q

from pathlib import Path # pathlib is a module in Python that provides an object-oriented interface for working with file system paths. Path is a class in the pathlib module that represents a file system path.
from ydata_profiling import ProfileReport
profile = ProfileReport(
        df, title="Profile Report of the Titanic Dataset - Pre-Profiling", explorative=True
    )
profile.to_file(Path("titanic_preprofiling_report_pp.html"))
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [23]:
df.columns # Display the columns in the DataFrame
Out[23]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [24]:
cat_cols = ['Survived', 'Pclass', 'Sex', 'SibSp','Parch', 'Embarked']

# Lets check the unique values in each categorical column
for col in cat_cols:
    print(f"Unique values in {col}: {df[col].unique()}")

print("="*143)
# Lets check the value counts of each categorical column
for col in cat_cols:
    print(f"Value counts for {col}:")
    print(df[col].value_counts())
    print("."*143)
    print(df[col].value_counts(normalize=True)*100)  # Normalized value counts
    print("."*143)
    df[col].value_counts().plot(kind='bar', title=col)
    plt.xticks(rotation=45)
    plt.xlabel(col)
    plt.ylabel('Count')
    # Plot text on top of the bars
    for index, value in enumerate(df[col].value_counts()):
        plt.text(index, value + 5, str(value), ha='center', va='bottom')
    plt.show()

# Insights from the value counts:
# - Most passengers did not survive (0 = No, 1 = Yes), with a survival rate of approximately 38.4%.
# - The majority of passengers were in third class (Pclass = 3), followed by first class (Pclass = 1) and second class (Pclass = 2).
# - Majority of passengers were male (Sex = male), with a smaller proportion of female passengers (Sex = female).
# - Most passengers had no siblings or spouses aboard (SibSp = 0), followed by those with one sibling or spouse (SibSp = 1).
# - Most passengers had no parents or children aboard (Parch = 0), followed by those with one parent or child (Parch = 1).
# - The majority of passengers embarked from Southampton (Embarked = S), followed by Cherbourg (Embarked = C) and Queenstown (Embarked = Q).
Unique values in Survived: [0 1]
Unique values in Pclass: [3 1 2]
Unique values in Sex: ['male' 'female']
Unique values in SibSp: [1 0 3 4 2 5 8]
Unique values in Parch: [0 1 2 5 3 4 6]
Unique values in Embarked: ['S' 'C' 'Q' nan]
===============================================================================================================================================
Value counts for Survived:
Survived
0    549
1    342
Name: count, dtype: int64
...............................................................................................................................................
Survived
0    61.616162
1    38.383838
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
Value counts for Pclass:
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
...............................................................................................................................................
Pclass
3    55.106622
1    24.242424
2    20.650954
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
Value counts for Sex:
Sex
male      577
female    314
Name: count, dtype: int64
...............................................................................................................................................
Sex
male      64.758698
female    35.241302
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
Value counts for SibSp:
SibSp
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: count, dtype: int64
...............................................................................................................................................
SibSp
0    68.237935
1    23.456790
2     3.142536
4     2.020202
3     1.795735
8     0.785634
5     0.561167
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
Value counts for Parch:
Parch
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: count, dtype: int64
...............................................................................................................................................
Parch
0    76.094276
1    13.243547
2     8.978676
5     0.561167
3     0.561167
4     0.448934
6     0.112233
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
Value counts for Embarked:
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64
...............................................................................................................................................
Embarked
S    72.440945
C    18.897638
Q     8.661417
Name: proportion, dtype: float64
...............................................................................................................................................
No description has been provided for this image
In [25]:
# Find the missing values in each column in the dataset
df.isnull().sum().sort_values(ascending=False)
Out[25]:
Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64
In [26]:
len(df)
Out[26]:
891
In [27]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column:
Cabin          77.104377
Age            19.865320
Embarked        0.224467
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
dtype: float64
In [28]:
# Insights from the missing values percentage:
# - Age has 19.9% missing values, which is significant and may require imputation or removal.
# - Cabin has 77.1% missing values, which is very high and may not be useful for analysis.
# - Embarked has only 0.2% missing values, which is negligible and can be easily handled.
In [29]:
# Drop the Cabin column as it has too many missing values
# df.drop(columns=['Cabin'], inplace=True)
# or
df.drop(['Cabin'], inplace=True, axis=1)
# Drop the PassengerId column as it is not useful for analysis
In [30]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column:
Age            19.865320
Embarked        0.224467
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
dtype: float64
In [31]:
# Embarked has only 0.2% missing values, which is negligible and can be easily handled.
# Lets understand  the various ways in which a categorical column can have missing values can be handled:
# 1. Drop the rows with missing values in the categorical column
# 2. Fill the missing values with the mode of the categorical column
# 3. Fill the missing values with a placeholder value (e.g., 'Unknown', 'Not Available')
# 4. Use forward fill or backward fill to fill the missing values based on the previous or next value in the column https://miro.medium.com/v2/resize:fit:1348/0*BXk8ZdgFBXNw2yuZ
# 5. Use machine learning algorithms to predict the missing values based on other features in the dataset (advanced technique)
In [32]:
# forward fill or backward fill demonstration

# Creating the dummy student dataset
data = {
    'Student_ID': [101, 102, 103, 104, 105, 106],
    'Name': ['Alice', 'Bob', np.nan, 'David', np.nan, 'Frank'],
    'Grade': ['A', np.nan, 'B', np.nan, 'C', np.nan],
    'Marks': [95, np.nan, 88, np.nan, np.nan, 72]
}

demo = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
demo
Original DataFrame with Missing Values:
Out[32]:
Student_ID Name Grade Marks
0 101 Alice A 95.0
1 102 Bob NaN NaN
2 103 NaN B 88.0
3 104 David NaN NaN
4 105 NaN C NaN
5 106 Frank NaN 72.0
In [33]:
# ffill forward fill the missing values in the Name column
demo['Name_ffill'] = demo['Name'].fillna(method='ffill')
# bfill backward fill the missing values in the Name column
demo['Name_bfill'] = demo['Name'].fillna(method='bfill')
demo
Out[33]:
Student_ID Name Grade Marks Name_ffill Name_bfill
0 101 Alice A 95.0 Alice Alice
1 102 Bob NaN NaN Bob Bob
2 103 NaN B 88.0 Bob David
3 104 David NaN NaN David David
4 105 NaN C NaN David Frank
5 106 Frank NaN 72.0 Frank Frank
In [34]:
df.Embarked.mode()  # Find the mode of the Embarked column
Out[34]:
0    S
Name: Embarked, dtype: object
In [35]:
type(df.Embarked.mode())
Out[35]:
pandas.core.series.Series
In [36]:
df.Embarked.mode()[0]  # Get the mode of the Embarked column
Out[36]:
'S'
In [37]:
# Lets fill the missing values in the Embarked column with the mode of the column
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
In [38]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column:
Age            19.86532
PassengerId     0.00000
Survived        0.00000
Pclass          0.00000
Name            0.00000
Sex             0.00000
SibSp           0.00000
Parch           0.00000
Ticket          0.00000
Fare            0.00000
Embarked        0.00000
dtype: float64
In [39]:
# Lets fill the missing values in the Age column
# Age column is numerical in nature, so we can fill the missing values with either the mean or median of the column
# Rules of thumb:
# 1. If the data is normally distributed, use the mean to fill missing values.
# 2. If the data is skewed, use the median to fill missing values.

# Lets check the distribution of the Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Age with Mean and Median')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Lets calculate the skewness of the Age column
print("Skewness of the Age column:", df['Age'].skew())
# If this value is between:
# a) -0.5 and 0.5, the distribution of the value is almost symmetrical
# b) -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
# c) If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed.
No description has been provided for this image
Skewness of the Age column: 0.38910778230082704
In [40]:
# Credit: Shivamsuman
# Insights from the skewness:
# Looking at zero value, It doesn’t seems like age ==0 has do many passengers
In [41]:
# Since the skewness of the Age column is 0.389, which is between -0.5 and 0.5, the distribution is almost symmetrical.
# Therefore, we can use the mean to fill the missing values in the Age column
df['Age'].fillna(df['Age'].mean(), inplace=True)
In [42]:
# Lets check the distribution of the Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Age with Mean and Median')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Lets calculate the skewness of the Age column
print("Skewness of the Age column:", df['Age'].skew())
# If this value is between:
# a) -0.5 and 0.5, the distribution of the value is almost symmetrical
# b) -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
# c) If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed.
No description has been provided for this image
Skewness of the Age column: 0.4344880940129925
In [43]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column:
PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64
In [44]:
# Now, the dataset has no missing values
# Lets check the shape of the dataset again
print("Shape of the dataset after handling missing values:", df.shape)
Shape of the dataset after handling missing values: (891, 11)
In [45]:
# Let us check for duplicates in the dataset
print("Number of duplicate rows in the dataset:", df.duplicated().sum())
# Insights from the duplicate rows:
# There are no duplicate rows in the dataset, which is good as it ensures that each passenger is unique and there are no repeated entries.
Number of duplicate rows in the dataset: 0
In [46]:
# Now that the dataset is clean, we can proceed with the Feature Engineering step of the EDA process.
# What is Feature Engineering?
# Feature - Column, Engineering - Creating
# Feature Engineering is the process of creating new and more meaningful features from the existing features in the dataset.
# Feature Engineering is the process of using domain knowledge to extract features (variables) that make machine learning algorithms work. It involves creating new features or modifying existing ones to improve the predictive power of the dataset.
# Feature Engineering can include:
# 1. Creating new features from existing ones (e.g., extracting titles from names, creating age groups)
# 2. Encoding categorical variables (e.g., converting 'Sex' and 'Embarked' columns into numerical format)
# 3. Normalizing or scaling numerical features (e.g., scaling 'Fare' and 'Age' columns)
# 4. Handling outliers (e.g., removing or transforming extreme values)
# 5. Creating interaction features (e.g., combining 'Pclass' and 'Sex' to create a new feature)

# General examples of Feature Engineering:
# 1. Calculating the Body Mass Index (BMI) from weight and height features in a health dataset.
# 2. Extracting the year, month, and day from a date feature in a time series dataset.
# 3. If a dataset contains a timestamp, creating new features like hour, day of the week, or month to capture temporal patterns.
# 4. If I have a column called as Date of Birth, I can create a new column called Age by subtracting the Date of Birth from the current date.
# 5. If I have a columns like basic salary, dearness allowance, house rent allowance, I can create a new column called Gross Salary by adding these columns together.

df.head()
Out[46]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
In [47]:
# Lets engineer some features in the Titanic dataset
# 1. FamilySize =  SibSp + Parch + 1 (adding 1 to include the passenger themselves)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df.head()
Out[47]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1
In [48]:
# 2. isAlone = 1 if FamilySize == 1 else 0
# df['isAlone'] = np.where(df['FamilySize'] == 1, 1, 0)
# or
df['isAlone'] = df['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
df.head()
Out[48]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1
In [49]:
df["isAlone"].value_counts().plot(kind='bar', title='isAlone')
plt.xticks(rotation=0)
plt.xlabel('isAlone')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df["isAlone"].value_counts()):
    plt.text(index, value, str(value), ha='center', va='bottom')
plt.show()

# Insights from the isAlone feature:
# - Most passengers were alone (isAlone = 1), with a count of 537.
# - A smaller proportion of passengers traveled with family (isAlone = 0), with a count of 354.
No description has been provided for this image
In [50]:
df.head()
Out[50]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1
In [51]:
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(","))
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1])
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split("."))
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split(".")[0])
print("=" * 143)
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split(".")[0].strip())
print("Braund, Mr. Owen Harris".split(",")[1].split(".")[0].strip())
print("Heikkinen, Miss. Laina".split(",")[1].split(".")[0].strip())
print("Allen, Mr. William Henry".split(",")[1].split(".")[0].strip())
['Futrelle', ' Mrs. Jacques Heath (Lily May Peel)']
 Mrs. Jacques Heath (Lily May Peel)
[' Mrs', ' Jacques Heath (Lily May Peel)']
 Mrs
===============================================================================================================================================
Mrs
Mr
Miss
Mr
In [52]:
# 3. Title extraction from Name column
# Way1 - User-defined function to extract title from Name column
# def extract_title(name):
#     # Split the name by comma and then by dot to get the title
#     return name.split(",")[1].split(".")[0].strip()
# # Apply the function to the Name column to create a new Title column
# df['Title'] = df['Name'].apply(extract_title)
# or
# Way2 - lambda function
df['Title'] = df['Name'].apply(lambda x: x.split(",")[1].split(".")[0].strip())
df.head()
Out[52]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [53]:
df.head()
Out[53]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [54]:
# Lets analyze the Fare column
df["Fare"].describe()
# Insights from the Fare column:
# - The average fare paid by passengers is approximately 32.20 British pounds.
# - The minimum fare is 0.00 British pounds, which indicates that some passengers may have traveled for free or at a very low cost.
# - The maximum fare is 512.33 British pounds, which indicates that some passengers paid significantly higher fares.
# - The standard deviation of the fare is approximately 49.69, indicating a wide range of fares paid by passengers.
# Visualizing the distribution of the Fare column
sns.histplot(df['Fare'], kde=True)
plt.axvline(df['Fare'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Fare'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.axvline(df['Fare'].quantile(0.75), color='blue', linestyle='dotted', linewidth=1, label='75th Percentile')
plt.axvline(df['Fare'].quantile(0.85), color='purple', linestyle='dotted', linewidth=1, label='85th Percentile')
plt.axvline(df['Fare'].quantile(0.90), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.axvline(df['Fare'].quantile(0.95), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.axvline(df['Fare'].quantile(0.99), color='brown', linestyle='dotted', linewidth=1, label='99th Percentile')
plt.axvline(df['Fare'].quantile(1.00), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.title('Distribution of Fare with Mean and Median')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Insights from the Fare distribution:
# - The distribution of the Fare column is right-skewed, with a long tail towards higher fares.
# - The mean fare is higher than the median fare, indicating that there are some passengers who paid significantly higher fares.
# - The majority of passengers paid fares below 100 British pounds, with a few passengers paying much higher fares.
No description has been provided for this image
In [55]:
# Lets find the passengers who paid zero
zero_fare_passengers = df[df['Fare'] == 0]
zero_fare_passengers
# Insights from the zero fare passengers:
# - There are 15 passengers who paid a fare of 0.00 British pounds.
# These most likely represent passengers who were crew members because:
# 1. All are adults (age > 18)
# 2. All are males
# 3. All embarked from Southampton (Embarked = S)
# 4. All have isAlone = 1 (indicating they traveled alone)
Out[55]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
179 180 0 3 Leonard, Mr. Lionel male 36.000000 0 0 LINE 0.0 S 1 1 Mr
263 264 0 1 Harrison, Mr. William male 40.000000 0 0 112059 0.0 S 1 1 Mr
271 272 1 3 Tornquist, Mr. William Henry male 25.000000 0 0 LINE 0.0 S 1 1 Mr
277 278 0 2 Parkes, Mr. Francis "Frank" male 29.699118 0 0 239853 0.0 S 1 1 Mr
302 303 0 3 Johnson, Mr. William Cahoone Jr male 19.000000 0 0 LINE 0.0 S 1 1 Mr
413 414 0 2 Cunningham, Mr. Alfred Fleming male 29.699118 0 0 239853 0.0 S 1 1 Mr
466 467 0 2 Campbell, Mr. William male 29.699118 0 0 239853 0.0 S 1 1 Mr
481 482 0 2 Frost, Mr. Anthony Wood "Archie" male 29.699118 0 0 239854 0.0 S 1 1 Mr
597 598 0 3 Johnson, Mr. Alfred male 49.000000 0 0 LINE 0.0 S 1 1 Mr
633 634 0 1 Parr, Mr. William Henry Marsh male 29.699118 0 0 112052 0.0 S 1 1 Mr
674 675 0 2 Watson, Mr. Ennis Hastings male 29.699118 0 0 239856 0.0 S 1 1 Mr
732 733 0 2 Knight, Mr. Robert J male 29.699118 0 0 239855 0.0 S 1 1 Mr
806 807 0 1 Andrews, Mr. Thomas Jr male 39.000000 0 0 112050 0.0 S 1 1 Mr
815 816 0 1 Fry, Mr. Richard male 29.699118 0 0 112058 0.0 S 1 1 Mr
822 823 0 1 Reuchlin, Jonkheer. John George male 38.000000 0 0 19972 0.0 S 1 1 Jonkheer
In [56]:
# Extract row where Title is "Capt"
captain_passengers = df[df['Title'] == 'Capt']
captain_passengers
# Insights from the captain passengers:
# - There is only one passenger with the title "Capt" (Captain).
# - The captain is a 70-year-old male.
# He was traveling with his family probably as he has a family size of 3.
# The Fare paid is not zero, because he is a captain and he would have paid a fare for his family.
# But, there must be a suspicion that needs to be considered here: Is he really the captain of the Titanic or is he a captain of some other ship or a captain in the army? -> Needs more data to confirm this.
# However, we can assume that he is the captain of the Titanic as well because of the following reasons:
# 1. He is the only passenger with the title "Capt" in the dataset.
# 2. He is senior in age (70 years old), which is a common age for captains.
# 3. He did not survive the Titanic disaster, inspite of being a senior citizen.
Out[56]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0 S 3 0 Capt
In [57]:
# Is my Title column perfect?
# Lets check the unique values in the Title column
print("Unique values in Title column:", df['Title'].unique())
print("Value counts for Title column:")
print(df['Title'].value_counts())
# Insights from the Title column:
# - The Title column contains various titles such as 'Mr', 'Mrs', 'Miss', 'Master', 'Dr', 'Rev', 'Col', 'Major', 'Mlle', 'Mme', 'Don', 'Dona', and 'Jonkheer'.
# - Some titles are less common, such as 'Mlle', 'Mme', 'Don', 'Dona', and 'Jonkheer'.
# - There are some titles that can be grouped together, such as 'Mlle' and 'Mme' (both are French titles for Miss and Mrs, respectively), and 'Don' and 'Dona' (Spanish titles for Mr and Mrs, respectively).
# Lets group the titles into broader categories
# Grouping titles into broader categories
title_mapping = {
    'Mr': 'Mr',
    'Mrs': 'Mrs',
    'Miss': 'Miss',
    'Master': 'Master',
    'Don': 'Mr',
    'Rev': 'Mr',
    'Dr': 'Mr',
    'Mme': 'Mrs',
    'Ms': 'Miss',
    'Major': 'Mr',
    'Lady': 'Mrs',
    'Sir': 'Mr',
    'Mlle': 'Miss',
    'Col': 'Mr',
    'Capt': 'Mr',
    'the Countess': 'Mrs',
    'Jonkheer': 'Mr'
}
# Apply the mapping to the Title column
df['Title'] = df['Title'].map(title_mapping)
# Check the unique values in the Title column after mapping
print("Unique values in Title column after mapping:", df['Title'].unique())
# Check the value counts for the Title column after mapping
print("Value counts for Title column after mapping:")
print(df['Title'].value_counts())
# Plotting the value counts for the Title column after mapping
df['Title'].value_counts().plot(kind='bar', title='Title')
plt.xticks(rotation=0)
plt.xlabel('Title')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df['Title'].value_counts()):
    plt.text(index, value + 5, str(value), ha='center', va='bottom')
plt.show()
# Insights from the Title column after mapping:
Unique values in Title column: ['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer']
Value counts for Title column:
Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: count, dtype: int64
Unique values in Title column after mapping: ['Mr' 'Mrs' 'Miss' 'Master']
Value counts for Title column after mapping:
Title
Mr        538
Miss      185
Mrs       128
Master     40
Name: count, dtype: int64
No description has been provided for this image

Logical Grouping and Justification:¶

Rare Title Mapped To Justification
Mr Mr Already common
Mrs Mrs Already common
Miss Miss Already common
Master Master Already common
Don, Sir, Jonkheer Mr Noblemen or male honorifics
Dr, Rev, Col, Capt, Major Mr Male professionals or officers
Mme, the Countess, Lady Mrs Married women or titled nobility
Ms, Mlle Miss Variants or equivalents of Miss

In [58]:
df.Title.value_counts(normalize=True) * 100
Out[58]:
Title
Mr        60.381594
Miss      20.763187
Mrs       14.365881
Master     4.489338
Name: proportion, dtype: float64
In [59]:
df.head()
Out[59]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [60]:
# Lets find the correlation between the numerical columns in the dataset
# Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.
# It ranges from -1 to 1, where:
# -1 indicates a perfect negative correlation (as one variable increases, the other decreases)
# 0 indicates no correlation (no relationship between the variables)
# 1 indicates a perfect positive correlation (as one variable increases, the other also increases)
# Correlation can be calculated using the Pearson correlation coefficient, which is the most common method for measuring correlation.
# In Pandas, we can use the corr() method to calculate the correlation between numerical columns in a DataFrame.
# Let's check the correlation between the numerical columns in the dataset
print("Correlation between numerical columns in the dataset:")
# Using numeric_only=True to avoid TypeError: unsupported operand type(s) for +: 'int' and 'str'
cor = df.corr(numeric_only=True)
# Display the correlation matrix
cor
Correlation between numerical columns in the dataset:
Out[60]:
PassengerId Survived Pclass Age SibSp Parch Fare FamilySize isAlone
PassengerId 1.000000 -0.005007 -0.035144 0.033207 -0.057527 -0.001652 0.012658 -0.040143 0.057462
Survived -0.005007 1.000000 -0.338481 -0.069809 -0.035322 0.081629 0.257307 0.016639 -0.203367
Pclass -0.035144 -0.338481 1.000000 -0.331339 0.083081 0.018443 -0.549500 0.065997 0.135207
Age 0.033207 -0.069809 -0.331339 1.000000 -0.232625 -0.179191 0.091566 -0.248512 0.179775
SibSp -0.057527 -0.035322 0.083081 -0.232625 1.000000 0.414838 0.159651 0.890712 -0.584471
Parch -0.001652 0.081629 0.018443 -0.179191 0.414838 1.000000 0.216225 0.783111 -0.583398
Fare 0.012658 0.257307 -0.549500 0.091566 0.159651 0.216225 1.000000 0.217138 -0.271832
FamilySize -0.040143 0.016639 0.065997 -0.248512 0.890712 0.783111 0.217138 1.000000 -0.690922
isAlone 0.057462 -0.203367 0.135207 0.179775 -0.584471 -0.583398 -0.271832 -0.690922 1.000000
In [61]:
# Plotting the correlation matrix using seaborn heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(cor, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Columns')
plt.show()
No description has been provided for this image
In [62]:
# Insights from the correlation matrix:
# - The 'Survived' column has a positive correlation with 'Pclass' (0.34), indicating that passengers in higher classes (1st class) had a higher survival rate.
# - The 'Survived' column has a negative correlation with 'Fare' (-0.09), indicating that passengers who paid higher fares had a slightly lower survival rate.
# - The 'Age' column has a weak positive correlation with 'Fare' (0.09), indicating that older passengers tended to pay higher fares.
# - The 'SibSp' and 'Parch' columns have a weak positive correlation (0.41), indicating that passengers with more siblings/spouses tended to have more parents/children aboard.
# - The 'FamilySize' column has a positive correlation with 'SibSp' (0.69) and 'Parch' (0.68), indicating that larger families tended to have more siblings/spouses and parents/children aboard.
# - The 'FamilySize' column has a weak negative correlation with 'Survived' (-0.08), indicating that larger families had a slightly lower survival rate.
# - The 'isAlone' column has a weak negative correlation with 'Survived' (-0.18), indicating that passengers who were alone had a lower survival rate compared to those who traveled with family.
# - The 'Title' column is categorical, so it does not have a correlation with numerical columns, but it can be analyzed separately.
In [63]:
df.head()
Out[63]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [64]:
# Lets understand the Survived column in detail
df.Survived.value_counts().plot(kind='bar', title='Survived')
plt.xticks(rotation=0)
plt.xlabel('Survived')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df.Survived.value_counts()):
    plt.text(index, value + 5, str(value), ha='center', va='bottom')
plt.show()
# Insights from the Survived column:
# - The majority of passengers did not survive (0 = No, 1 = Yes), with a survival rate of approximately 38.4%.
No description has been provided for this image
In [65]:
df.head()
Out[65]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [66]:
df.groupby('Pclass')['Survived'].mean()
Out[66]:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
In [67]:
df.groupby('Pclass')[['Survived']].mean()
Out[67]:
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
In [68]:
df.groupby('Pclass')[['Survived']].mean().index
Out[68]:
Index([1, 2, 3], dtype='int64', name='Pclass')
In [69]:
df.groupby('Pclass')[['Survived']].mean().values
Out[69]:
array([[0.62962963],
       [0.47282609],
       [0.24236253]])
In [70]:
# How is Pclass related to Survived?
print(df.groupby('Pclass')['Survived'].mean())
# Insights:
# - The survival rate is highest for passengers in 1st class (Pclass = 1) at approximately 62.96%.
# - The survival rate is lower for passengers in 2nd class (Pclass = 2) at approximately 47.28%.
# - The survival rate is lowest for passengers in 3rd class (Pclass = 3) at approximately 24.24%.
# This indicates that passengers in higher classes had a significantly higher chance of survival compared to those in lower classes.

df.groupby('Pclass')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Passenger Class (Pclass)')
plt.xlabel('Passenger Class (Pclass)')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('Pclass')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
No description has been provided for this image
In [71]:
# Visualizing the relationship between Pclass and Survived
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title('Survival Count by Passenger Class (Pclass)')
plt.xlabel('Passenger Class (Pclass)')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [72]:
df.head()
Out[72]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [73]:
# How is Sex related to Survived?
print(df.groupby('Sex')['Survived'].mean())
# Insights:
# - The survival rate is higher for females (Sex = female) at approximately 74.2%.
# - The survival rate is lower for males (Sex = male) at approximately 18.9%.
# This indicates that females had a significantly higher chance of survival compared to males.

df.groupby('Sex')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('Sex')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
No description has been provided for this image
In [74]:
# Visualizing the relationship between Sex and Survived
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival Count by Sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [75]:
df.head()
Out[75]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [76]:
# How is Embarked related to Survived?
print(df.groupby('Embarked')['Survived'].mean())
# Insights:
# - The survival rate is higher for passengers who embarked at Cherbourg (Embarked = C) at approximately 55.0%.
# - The survival rate is lower for passengers who embarked at Southampton (Embarked = S) at approximately 33.0%.
# - The survival rate is lowest for passengers who embarked at Queenstown (Embarked = Q) at approximately 23.0%.
# This indicates that the place of embarkation had an impact on the survival rate.

df.groupby('Embarked')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Embarked')
plt.xlabel('Embarked')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('Embarked')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64
No description has been provided for this image
In [77]:
# Visualizing the relationship between Embarked and Survived
sns.countplot(x='Embarked', hue='Survived', data=df)
plt.title('Survival Count by Embarked')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [78]:
df.head()
Out[78]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [79]:
# How is isAlone related to Survived?
print(df.groupby('isAlone')['Survived'].mean())
# Insights:
# - The survival rate is higher for passengers who were not alone (isAlone = 0) at approximately 30.0%.
# - The survival rate is lower for passengers who were alone (isAlone = 1) at approximately 50.0%.
# This indicates that passengers who traveled with family had a higher chance of survival compared to those who were alone.

df.groupby('isAlone')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by isAlone')
plt.xlabel('isAlone')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('isAlone')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
isAlone
0    0.505650
1    0.303538
Name: Survived, dtype: float64
No description has been provided for this image
In [80]:
# Visualizing the relationship between isAlone and Survived
sns.countplot(x='isAlone', hue='Survived', data=df)
plt.title('Survival Count by isAlone')
plt.xlabel('isAlone')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [81]:
df.head()
Out[81]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [82]:
# How is Title related to Survived?
print(df.groupby('Title')['Survived'].mean())
# Insights:
# - The survival rate is higher for Mrs (Title = Mrs) at approximately 78.0%.
# - The survival rate is lower for Mr (Title = Mr) at approximately 15.0%.

df.groupby('Title')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Title')
plt.xlabel('Title')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('Title')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Title
Master    0.575000
Miss      0.702703
Mr        0.161710
Mrs       0.796875
Name: Survived, dtype: float64
No description has been provided for this image
In [83]:
# Visualizing the relationship between Title and Survived
sns.countplot(x='Title', hue='Survived', data=df)
plt.title('Survival Count by Title')
plt.xlabel('Title')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [84]:
df.head()
Out[84]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr

image.png

In [85]:
# Feature Engineering:
# - Create a new feature "GenderClass":
# If Age < 15, GenderClass = "child"
# Else GenderClass = "male" or "female" as per "Sex" column
def create_gender_class(titanic_df):
    if titanic_df['Age'] < 15:
        return "child"
    else:
        return titanic_df['Sex']

df['GenderClass'] = df.apply(create_gender_class, axis=1)
df["GenderClass"].value_counts().plot(kind='bar', title='GenderClass')
plt.xticks(rotation=0)
plt.xlabel('GenderClass')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df["GenderClass"].value_counts()):
    plt.text(index, value, f"{value}", ha='center', va='bottom')
plt.show()
No description has been provided for this image
In [ ]:
df["GenderClass"].value_counts()
# Insights from the GenderClass feature:
# - There were 78 children (Age < 15) aboard the Titanic.
Out[ ]:
GenderClass
male      538
female    275
child      78
Name: count, dtype: int64
In [87]:
df.head()
Out[87]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male
In [ ]:
# How is GenderClass related to Survived?
print(df.groupby('GenderClass')['Survived'].mean())
# Insights:
# you can write the insights here

df.groupby('GenderClass')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by GenderClass')
plt.xlabel('GenderClass')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
#  Plot text on top of the bars
for index, value in enumerate(df.groupby('GenderClass')['Survived'].mean()):
    plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
GenderClass
child     0.576923
female    0.760000
male      0.163569
Name: Survived, dtype: float64
No description has been provided for this image
In [89]:
# Visualizing the relationship between GenderClass and Survived
sns.countplot(x='GenderClass', hue='Survived', data=df)
plt.title('Survival Count by GenderClass')
plt.xlabel('GenderClass')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
No description has been provided for this image
In [90]:
df.head()
Out[90]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male
In [91]:
# Drop "GenderClass" feature
df = df.drop(columns=['GenderClass'])
# or
# df.drop(['GenderClass'], inplace=True, axis=1)
df.head()
Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr
In [93]:
# Create "GenderClass" column again with a different approach - lambda function
df['GenderClass'] = df.apply(lambda titanic_df: 'child' if titanic_df['Age'] < 15 else titanic_df["Sex"], axis=1)
df["GenderClass"].value_counts()
Out[93]:
GenderClass
male      538
female    275
child      78
Name: count, dtype: int64
In [94]:
df.head()
Out[94]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male
In [95]:
# Binning Age into categories
# Logic:
# 1. If Age < 15, AgeGroup = "child"
# 2. If Age >= 15 and Age < 30, AgeGroup = "young"
# 3. If Age >= 30 and Age < 50, AgeGroup = "middle-aged"
# 4. If Age >= 50, AgeGroup = "senior"
def create_age_group(titanic_df):
    if titanic_df['Age'] < 15:
        return 'child'
    elif 15 <= titanic_df['Age'] < 30:
        return 'young'
    elif 30 <= titanic_df['Age'] < 50:
        return 'middle-aged'
    else:
        return 'senior'
df['AgeGroup'] = df.apply(create_age_group, axis=1)
df.head()
Out[95]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male young
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female middle-aged
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female young
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female middle-aged
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male middle-aged
In [ ]:
df["AgeGroup"].value_counts()
# Insights from the AgeGroup feature:
# Most passengers were young (AgeGroup = young), followed by middle-aged (AgeGroup = middle-aged), further followed by children (AgeGroup = child), further followed by seniors (AgeGroup = senior).
Out[ ]:
AgeGroup
young          483
middle-aged    256
child           78
senior          74
Name: count, dtype: int64
In [97]:
# Drop "AgeGroup" feature
df = df.drop(columns=['AgeGroup'])
# or
# df.drop(['AgeGroup'], inplace=True, axis=1)
df.head()
Out[97]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male
In [ ]:
# Binning Age into categories using a different approach - pd.cut()
# Logic:
# 1. If Age < 15, AgeGroup = "child"
# 2. If Age >= 15 and Age < 30, AgeGroup = "young"
# 3. If Age >= 30 and Age < 50, AgeGroup = "middle-aged"
# 4. If Age >= 50, AgeGroup = "senior"
bins = [0, 15, 30, 50, 100]
labels = ['child', 'young', 'middle-aged', 'senior']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
# right=False - it means  include left value and exclude right value
df.head()
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male young
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female middle-aged
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female young
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female middle-aged
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male middle-aged

image.png

In [99]:
df["AgeGroup"].value_counts()
Out[99]:
AgeGroup
young          483
middle-aged    256
child           78
senior          74
Name: count, dtype: int64
In [100]:
df.head()
Out[100]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male young
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female middle-aged
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female young
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female middle-aged
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male middle-aged
In [104]:
# Ramani
# Can we correlate the females that survived, whether their children survived or not?

# Filter out all Title, except "Mr"
except_Mr_passengers = df[df['Title'] != 'Mr']
except_Mr_passengers.head()
Out[104]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female middle-aged
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female young
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female middle-aged
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S 5 0 Master child child
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 S 3 0 Mrs female young
In [105]:
except_Mr_passengers.shape # (number of female passengers, number of columns)
Out[105]:
(353, 16)
In [106]:
except_Mr_passengers.Title.value_counts()
Out[106]:
Title
Miss      185
Mrs       128
Master     40
Name: count, dtype: int64
In [108]:
df.head()
Out[108]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 2 0 Mr male young
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 2 0 Mrs female middle-aged
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1 Miss female young
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 2 0 Mrs female middle-aged
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 1 Mr male middle-aged

image.png

In [109]:
# ML Problem Statement:
# Aim: To predict the survival of passengers based on their features.
columns_to_drop = ['PassengerId', 'Name', "Sex", "SibSp", "Parch", "Ticket"]
# Dropping unnecessary columns
df.drop(columns=columns_to_drop, inplace=True)
# or
# df = df.drop(columns=columns_to_drop)
# Display the first few rows of the cleaned DataFrame
df.head()
Out[109]:
Survived Pclass Age Fare Embarked FamilySize isAlone Title GenderClass AgeGroup
0 0 3 22.0 7.2500 S 2 0 Mr male young
1 1 1 38.0 71.2833 C 2 0 Mrs female middle-aged
2 1 3 26.0 7.9250 S 1 1 Miss female young
3 1 1 35.0 53.1000 S 2 0 Mrs female middle-aged
4 0 3 35.0 8.0500 S 1 1 Mr male middle-aged

image.png

In [110]:
columns_to_dummify = ["Embarked", "Title", "GenderClass", "AgeGroup"]
# Creating dummy variables for categorical columns
df = pd.get_dummies(df, columns=columns_to_dummify, drop_first=True, dtype=int)
# drop_first=True - it means drop the first category of each categorical column to avoid multicollinearity
# Display the first few rows of the DataFrame after creating dummy variables
df.head()
Out[110]:
Survived Pclass Age Fare FamilySize isAlone Embarked_Q Embarked_S Title_Miss Title_Mr Title_Mrs GenderClass_female GenderClass_male AgeGroup_young AgeGroup_middle-aged AgeGroup_senior
0 0 3 22.0 7.2500 2 0 0 1 0 1 0 0 1 1 0 0
1 1 1 38.0 71.2833 2 0 0 0 0 0 1 1 0 0 1 0
2 1 3 26.0 7.9250 1 1 0 1 1 0 0 1 0 1 0 0
3 1 1 35.0 53.1000 2 0 0 1 0 0 1 1 0 0 1 0
4 0 3 35.0 8.0500 1 1 0 1 0 1 0 0 1 0 1 0

image.png

In [111]:
numerical_cols = ['Age', 'Fare']
# Scale numerical columns - StandardScaler
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
df[numerical_cols] = ss.fit_transform(df[numerical_cols])
# Display the first few rows of the DataFrame after scaling numerical columns
df.head()
Out[111]:
Survived Pclass Age Fare FamilySize isAlone Embarked_Q Embarked_S Title_Miss Title_Mr Title_Mrs GenderClass_female GenderClass_male AgeGroup_young AgeGroup_middle-aged AgeGroup_senior
0 0 3 -0.592481 -0.502445 2 0 0 1 0 1 0 0 1 1 0 0
1 1 1 0.638789 0.786845 2 0 0 0 0 0 1 1 0 0 1 0
2 1 3 -0.284663 -0.488854 1 1 0 1 1 0 0 1 0 1 0 0
3 1 1 0.407926 0.420730 2 0 0 1 0 0 1 1 0 0 1 0
4 0 3 0.407926 -0.486337 1 1 0 1 0 1 0 0 1 0 1 0
In [112]:
df.describe()
Out[112]:
Survived Pclass Age Fare FamilySize isAlone Embarked_Q Embarked_S Title_Miss Title_Mr Title_Mrs GenderClass_female GenderClass_male AgeGroup_young AgeGroup_middle-aged AgeGroup_senior
count 891.000000 891.000000 8.910000e+02 8.910000e+02 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 2.232906e-16 3.987333e-18 1.904602 0.602694 0.086420 0.725028 0.207632 0.603816 0.143659 0.308642 0.603816 0.542088 0.287318 0.083053
std 0.486592 0.836071 1.000562e+00 1.000562e+00 1.613459 0.489615 0.281141 0.446751 0.405840 0.489378 0.350940 0.462192 0.489378 0.498505 0.452765 0.276117
min 0.000000 1.000000 -2.253155e+00 -6.484217e-01 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 -5.924806e-01 -4.891482e-01 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 3.000000 0.000000e+00 -3.573909e-01 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000
75% 1.000000 3.000000 4.079260e-01 -2.424635e-02 2.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
max 1.000000 3.000000 3.870872e+00 9.667167e+00 11.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

image.png

In [114]:
# Plot the distribution of the scaled numerical columns

# Plotting the distribution of the scaled Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Scaled Age with Mean and Median')
plt.xlabel('Scaled Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
No description has been provided for this image
In [115]:
# Plotting the distribution of the scaled Fare column
sns.histplot(df['Fare'], kde=True)
plt.axvline(df['Fare'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Fare'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Scaled Fare with Mean and Median')
plt.xlabel('Scaled Fare')
plt.ylabel('Frequency')
plt.legend()
plt.show()
No description has been provided for this image

Happy Learning¶